<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>ArXiv CS.CV Papers (Image/Video Generation) - April 25, 2025</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/framer-motion/10.16.4/framer-motion.dev.js"></script>
    <!-- Example using Font Awesome (replace with your preferred icon library if needed) -->
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/all.min.css">
    <style>
        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap');

        :root {
            /* New Palette: Light, Clean, Futuristic with Teal/Aqua accents */
            --bg-color: #f8fafc; /* Tailwind slate-50 (Very Light Gray) */
            --card-bg-color: #ffffff; /* White */
            --text-color: #1e293b; /* Tailwind slate-800 (Dark Gray-Blue) */
            --text-muted-color: #64748b; /* Tailwind slate-500 (Medium Gray-Blue) */
            --header-color: #0f172a; /* Tailwind slate-900 (Very Dark Blue) */
            --highlight-primary: #14b8a6; /* Tailwind teal-500 */
            --highlight-secondary: #67e8f9; /* Tailwind cyan-300 */
            --border-color: #e2e8f0; /* Tailwind slate-200 (Light Gray) */
            --shadow-color: rgba(15, 23, 42, 0.08); /* Subtle shadow based on slate-900 */
        }

        body {
            background-color: var(--bg-color);
            color: var(--text-color);
            font-family: 'Inter', sans-serif;
            overflow-x: hidden; /* Prevent horizontal scroll */
            line-height: 1.6;
        }

        .bento-grid {
            display: grid;
            gap: 1.5rem; /* Tailwind gap-6 */
            grid-template-columns: 1fr; /* Force single column */
            padding-bottom: 4rem; /* Add padding at the bottom */
        }

        .bento-item {
            /* Apply semi-transparent white background and blur */
            background-color: rgba(255, 255, 255, 0.7); /* White with 70% opacity */
            backdrop-filter: blur(10px); /* Apply blur effect */
            -webkit-backdrop-filter: blur(10px); /* Safari prefix */
            border-radius: 1rem; /* Slightly larger radius */
            padding: 1.75rem; /* Slightly more padding */
            border: 1px solid rgba(226, 232, 240, 0.5); /* Lighter border with transparency */
            box-shadow: 0 4px 12px var(--shadow-color);
            transition: transform 0.3s ease-out, box-shadow 0.3s ease-out, background-color 0.3s ease-out;
            overflow: hidden; /* Ensure content doesn't overflow */
            position: relative; /* For potential pseudo-elements */
        }

        /* Removed ::before pseudo-element for a cleaner look */


        .bento-item:hover {
            transform: translateY(-6px);
            box-shadow: 0 10px 20px var(--shadow-color), 0 4px 8px rgba(15, 23, 42, 0.06); /* Adjusted hover shadow */
        }

        .paper-title {
            font-size: 1.125rem; /* Tailwind text-lg */
            font-weight: 600; /* Tailwind font-semibold */
            color: var(--highlight-primary); /* Use new primary highlight */
            margin-bottom: 0.75rem; /* Tailwind mb-3 */
            line-height: 1.4;
        }

        .paper-summary {
            font-size: 0.875rem; /* Tailwind text-sm */
            color: var(--text-muted-color);
            margin-bottom: 1.25rem; /* Tailwind mb-5 */
            line-height: 1.6;
        }

        .paper-link {
            display: inline-flex; /* Use flex for icon alignment */
            align-items: center;
            font-size: 0.875rem; /* Tailwind text-sm */
            font-weight: 600;
            color: var(--highlight-primary);
            text-decoration: none;
            padding: 0.5rem 1rem; /* Add padding */
            border-radius: 0.5rem; /* Slightly rounder */
            background-color: rgba(20, 184, 166, 0.08); /* Subtle teal background */
            border: 1px solid rgba(20, 184, 166, 0.2);
            transition: background-color 0.3s ease, color 0.3s ease, transform 0.2s ease;
        }

        .paper-link i {
            margin-right: 0.5rem; /* Tailwind mr-2 */
            transition: transform 0.3s ease;
        }

        .paper-link:hover {
            background-color: rgba(20, 184, 166, 0.15);
            color: #0d9488; /* Darker teal on hover */
            transform: translateY(-1px);
        }
        .paper-link:hover i {
             transform: translateX(2px);
        }

        .paper-authors {
            font-size: 0.75rem; /* Tailwind text-xs */
            color: var(--text-muted-color);
            margin-top: 1rem; /* Tailwind mt-4 */
            font-style: italic;
        }

        .header {
            text-align: center;
            margin-bottom: 3rem; /* Tailwind mb-12 */
            padding-top: 3rem; /* Tailwind pt-12 */
        }

        .header h1 {
            font-size: 2.5rem; /* Tailwind text-4xl or 5xl */
            font-weight: 700; /* Tailwind font-bold */
            color: var(--header-color);
            letter-spacing: -0.025em; /* Tailwind tracking-tight */
            margin-bottom: 0.5rem;
            /* Optional: Add a subtle text gradient */
            /* background: linear-gradient(90deg, var(--highlight-primary), var(--highlight-secondary)); */
            /* -webkit-background-clip: text; */
            /* -webkit-text-fill-color: transparent; */
        }

        .header p {
            font-size: 1.125rem; /* Tailwind text-lg */
            color: var(--text-muted-color);
            margin-top: 0.5rem; /* Tailwind mt-2 */
            max-width: 600px;
            margin-left: auto;
            margin-right: auto;
        }

        .footer {
            text-align: center;
            color: var(--text-muted-color);
            font-size: 0.875rem; /* Tailwind text-sm */
            padding-top: 2rem;
            padding-bottom: 2rem; /* Tailwind py-8 */
            border-top: 1px solid var(--border-color);
            margin-top: 4rem;
        }

        /* Simple line graphic element (optional) */
        .line-graphic {
            height: 1px; /* Thinner line */
            background: linear-gradient(90deg, rgba(20, 184, 166, 0), var(--highlight-primary), rgba(20, 184, 166, 0));
            opacity: 0.6;
            margin: 1.5rem 0; /* Adjust margin */
        }

        /* Framer Motion requires the script, styles enhance appearance */
        [data-motion-element] {
             /* Base styles for elements animated by Framer Motion */
        }

        .paper-tldr {
            font-size: 0.95rem; /* Slightly bigger than summary */
            color: #475569; /* Changed to Tailwind slate-600 (slightly darker than summary) */
            margin-top: 0.75rem; /* Tailwind mt-3 */
            margin-bottom: 0.75rem; /* Tailwind mb-2 */
            /* font-style: italic; */
            font-weight: bold;
        }

        .paper-rating {
            margin-top: 1rem; /* Tailwind mt-4 */
            margin-bottom: 1rem; /* Tailwind mb-4 */
            color: #f59e0b; /* Tailwind amber-500 */
        }

        .paper-rating i {
            margin-right: 0.125rem; /* Tailwind mr-0.5 */
        }

        /* Apply consistent star color to sub-ratings */
        .paper-sub-ratings .rating-item i {
            color: #f59e0b; /* Match overall rating star color (amber-500) */
            margin-right: 0.125rem; /* Consistent spacing */
        }

    </style>
</head>
<body class="container mx-auto px-4 antialiased">

    <motion.div
        initial="{ opacity: 0, y: -30 }"
        animate="{ opacity: 1, y: 0 }"
        transition="{ duration: 0.6, ease: 'easeOut' }"
        class="header"
        data-motion-element
    >
        <h1>AIGC Daily Papers</h1>
        <p>Daily papers related to Image/Video/Multimodal Generation from cs.CV</p>
        <p>April 25, 2025</p>
        <div class="line-graphic mt-4 mb-8 mx-auto w-1/4"></div> <!-- Added line graphic -->
    </motion.div>

    <div class="bento-grid" id="paper-grid">
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.0, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation</h2>
            <p class="paper-summary">Subject-driven text-to-image (T2I) generation aims to produce images that
align with a given textual description, while preserving the visual identity
from a referenced subject image. Despite its broad downstream applicability --
ranging from enhanced personalization in image generation to consistent
character representation in video rendering -- progress in this field is
limited by the lack of reliable automatic evaluation. Existing methods either
assess only one aspect of the task (i.e., textual alignment or subject
preservation), misalign with human judgments, or rely on costly API-based
evaluation. To address this, we introduce RefVNLI, a cost-effective metric that
evaluates both textual alignment and subject preservation in a single
prediction. Trained on a large-scale dataset derived from video-reasoning
benchmarks and image perturbations, RefVNLI outperforms or matches existing
baselines across multiple benchmarks and subject categories (e.g.,
\emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual
alignment and 8.5-point gains in subject consistency. It also excels with
lesser-known concepts, aligning with human preferences at over 87\% accuracy.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces refvnli, a new cost-effective metric for evaluating subject-driven text-to-image generation that assesses both textual alignment and subject preservation, demonstrating significant improvements over existing methods.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了一种新的、具有成本效益的评估指标 refvnli，用于评估主体驱动的文本到图像生成，该指标可以同时评估文本对齐和主体保持，并且比现有方法有了显著改进。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(10/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(9/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17502v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.05, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback</h2>
            <p class="paper-summary">Current text-to-video (T2V) generation models are increasingly popular due to
their ability to produce coherent videos from textual prompts. However, these
models often struggle to generate semantically and temporally consistent videos
when dealing with longer, more complex prompts involving multiple objects or
sequential events. Additionally, the high computational costs associated with
training or fine-tuning make direct improvements impractical. To overcome these
limitations, we introduce NeuS-E, a novel zero-training video refinement
pipeline that leverages neuro-symbolic feedback to automatically enhance video
generation, achieving superior alignment with the prompts. Our approach first
derives the neuro-symbolic feedback by analyzing a formal video representation
and pinpoints semantically inconsistent events, objects, and their
corresponding frames. This feedback then guides targeted edits to the original
video. Extensive empirical evaluations on both open-source and proprietary T2V
models demonstrate that NeuS-E significantly enhances temporal and logical
alignment across diverse prompts by almost 40%</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces neus-e, a zero-training video refinement pipeline using neuro-symbolic feedback to enhance the temporal and logical consistency of text-to-video generation, improving alignment by almost 40%.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了一种名为neus-e的零训练视频优化流程，该流程使用神经符号反馈来增强文本到视频生成的时间和逻辑一致性，并将对齐度提高了近40%。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(10/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(9/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17180v2" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Minkyu Choi, S P Sharan, Harsh Goel, Sahil Shah, Sandeep Chinchali</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.1, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models</h2>
            <p class="paper-summary">Autoregressive (AR) models, long dominant in language generation, are
increasingly applied to image synthesis but are often considered less
competitive than Diffusion-based models. A primary limitation is the
substantial number of image tokens required for AR models, which constrains
both training and inference efficiency, as well as image resolution. To address
this, we present Token-Shuffle, a novel yet simple method that reduces the
number of image tokens in Transformer. Our key insight is the dimensional
redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs),
where low-dimensional visual codes from visual encoder are directly mapped to
high-dimensional language vocabularies. Leveraging this, we consider two key
operations: token-shuffle, which merges spatially local tokens along channel
dimension to decrease the input token number, and token-unshuffle, which
untangles the inferred tokens after Transformer blocks to restore the spatial
arrangement for output. Jointly training with textual prompts, our strategy
requires no additional pretrained text-encoder and enables MLLMs to support
extremely high-resolution image synthesis in a unified next-token prediction
way while maintaining efficient training and inference. For the first time, we
push the boundary of AR text-to-image generation to a resolution of 2048x2048
with gratifying generation performance. In GenAI-benchmark, our 2.7B model
achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen
by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human
evaluations also demonstrate our prominent image generation ability in terms of
text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle
can serve as a foundational design for efficient high-resolution image
generation within MLLMs.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces token-shuffle, a method to reduce the number of image tokens in autoregressive models, enabling high-resolution (2048x2048) text-to-image generation in mllms with competitive performance compared to state-of-the-art ar and diffusion models.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文提出了一种名为 token-shuffle 的方法，旨在减少自回归模型中的图像 tokens 数量，从而使 mllm 能够在高分辨率 (2048x2048) 下生成文本到图像，并与最先进的 ar 模型和扩散模型相比具有竞争优势。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17789v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.15000000000000002, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Step1X-Edit: A Practical Framework for General Image Editing</h2>
            <p class="paper-summary">In recent years, image editing models have witnessed remarkable and rapid
development. The recent unveiling of cutting-edge multimodal models such as
GPT-4o and Gemini2 Flash has introduced highly promising image editing
capabilities. These models demonstrate an impressive aptitude for fulfilling a
vast majority of user-driven editing requirements, marking a significant
advancement in the field of image manipulation. However, there is still a large
gap between the open-source algorithm with these closed-source models. Thus, in
this paper, we aim to release a state-of-the-art image editing model, called
Step1X-Edit, which can provide comparable performance against the closed-source
models like GPT-4o and Gemini2 Flash. More specifically, we adopt the
Multimodal LLM to process the reference image and the user's editing
instruction. A latent embedding has been extracted and integrated with a
diffusion image decoder to obtain the target image. To train the model, we
build a data generation pipeline to produce a high-quality dataset. For
evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world
user instructions. Experimental results on GEdit-Bench demonstrate that
Step1X-Edit outperforms existing open-source baselines by a substantial margin
and approaches the performance of leading proprietary models, thereby making
significant contributions to the field of image editing.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces step1x-edit, a new open-source image editing model that aims to match the performance of closed-source models like gpt-4o using a multimodal llm and diffusion decoder, evaluated on a new benchmark called gedit-bench.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了step1x-edit，一种新型开源图像编辑模型，旨在通过使用多模态llm和扩散解码器来匹配gpt-4o等闭源模型的性能，并在名为gedit-bench的新基准上进行了评估。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17761v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.2, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive Fields</h2>
            <p class="paper-summary">StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic
faces of imaginary people from random noise. One limitation of GAN-based image
generation is the difficulty of controlling the features of the generated
image, due to the strong entanglement of the low-dimensional latent space.
Previous work that aimed to control StyleGAN with image or text prompts
modulated sampling in W latent space, which is more expressive than Z latent
space. However, W space still has restricted expressivity since it does not
control the feature synthesis directly; also the feature embedding in W space
requires a pre-training process to reconstruct the style signal, limiting its
application. This paper introduces the concept of "generative fields" to
explain the hierarchical feature synthesis in StyleGAN, inspired by the
receptive fields of convolution neural networks (CNNs). Additionally, we
propose a new image editing pipeline for StyleGAN using generative field theory
and the channel-wise style latent space S, utilizing the intrinsic structural
feature of CNNs to achieve disentangled control of feature synthesis at
synthesis time.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces "generative fields," inspired by cnn receptive fields, to achieve disentangled feature control in stylegan's synthesis process using the channel-wise style latent space s enhancing image editing capabilities.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文引入了受 cnn 感受野启发的“生成场”概念，旨在通过通道层面的风格潜在空间 s 实现 stylegan 合成过程中解耦的特征控制，从而增强图像编辑能力。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17712v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Zhuo He, Paul Henderson, Nicolas Pugeault</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.25, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Beyond Labels: Zero-Shot Diabetic Foot Ulcer Wound Segmentation with Self-attention Diffusion Models and the Potential for Text-Guided Customization</h2>
            <p class="paper-summary">Diabetic foot ulcers (DFUs) pose a significant challenge in healthcare,
requiring precise and efficient wound assessment to enhance patient outcomes.
This study introduces the Attention Diffusion Zero-shot Unsupervised System
(ADZUS), a novel text-guided diffusion model that performs wound segmentation
without relying on labeled training data. Unlike conventional deep learning
models, which require extensive annotation, ADZUS leverages zero-shot learning
to dynamically adapt segmentation based on descriptive prompts, offering
enhanced flexibility and adaptability in clinical applications. Experimental
evaluations demonstrate that ADZUS surpasses traditional and state-of-the-art
segmentation models, achieving an IoU of 86.68\% and the highest precision of
94.69\% on the chronic wound dataset, outperforming supervised approaches such
as FUSegNet. Further validation on a custom-curated DFU dataset reinforces its
robustness, with ADZUS achieving a median DSC of 75\%, significantly surpassing
FUSegNet's 45\%. The model's text-guided segmentation capability enables
real-time customization of segmentation outputs, allowing targeted analysis of
wound characteristics based on clinical descriptions. Despite its competitive
performance, the computational cost of diffusion-based inference and the need
for potential fine-tuning remain areas for future improvement. ADZUS represents
a transformative step in wound segmentation, providing a scalable, efficient,
and adaptable AI-driven solution for medical imaging.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces adzus, a novel text-guided diffusion model for zero-shot diabetic foot ulcer segmentation, outperforming existing methods with high iou and dsc scores. it allows for text-guided customization, enhancing clinical applicability, though computational cost and fine-tuning needs remain.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了一种新型的文本引导扩散模型adzus，用于零样本糖尿病足溃疡分割，其性能优于现有方法，具有较高的iou和dsc分数。它允许文本引导的定制，增强了临床适用性，但计算成本和微调需求仍然存在。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17628v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Abderrachid Hamrani, Daniela Leizaola, Renato Sousa, Jose P. Ponce, Stanley Mathis, David G. Armstrong, Anuradha Godavarty</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.30000000000000004, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Fast Autoregressive Models for Continuous Latent Generation</h2>
            <p class="paper-summary">Autoregressive models have demonstrated remarkable success in sequential data
generation, particularly in NLP, but their extension to continuous-domain image
generation presents significant challenges. Recent work, the masked
autoregressive model (MAR), bypasses quantization by modeling per-token
distributions in continuous spaces using a diffusion head but suffers from slow
inference due to the high computational cost of the iterative denoising
process. To address this, we propose the Fast AutoRegressive model (FAR), a
novel framework that replaces MAR's diffusion head with a lightweight shortcut
head, enabling efficient few-step sampling while preserving autoregressive
principles. Additionally, FAR seamlessly integrates with causal Transformers,
extending them from discrete to continuous token generation without requiring
architectural modifications. Experiments demonstrate that FAR achieves
$2.3\times$ faster inference than MAR while maintaining competitive FID and IS
scores. This work establishes the first efficient autoregressive paradigm for
high-fidelity continuous-space image generation, bridging the critical gap
between quality and scalability in visual autoregressive modeling.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces fast autoregressive model (far) for efficient continuous-space image generation, achieving faster inference than previous methods while maintaining competitive image quality.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了快速自回归模型（far），用于有效的连续空间图像生成，与以前的方法相比，实现了更快的推理速度，同时保持了具有竞争力的图像质量。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.18391v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Tiankai Hang, Jianmin Bao, Fangyun Wei, Dong Chen</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.35000000000000003, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Text-to-Image Alignment in Denoising-Based Models through Step Selection</h2>
            <p class="paper-summary">Visual generative AI models often encounter challenges related to text-image
alignment and reasoning limitations. This paper presents a novel method for
selectively enhancing the signal at critical denoising steps, optimizing image
generation based on input semantics. Our approach addresses the shortcomings of
early-stage signal modifications, demonstrating that adjustments made at later
stages yield superior results. We conduct extensive experiments to validate the
effectiveness of our method in producing semantically aligned images on
Diffusion and Flow Matching model, achieving state-of-the-art performance. Our
results highlight the importance of a judicious choice of sampling stage to
improve performance and overall image alignment.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a novel method for improving text-to-image alignment in denoising-based generative models by selectively enhancing the signal at critical denoising steps, achieving state-of-the-art results.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种新颖的方法，通过在关键去噪步骤中选择性地增强信号，来改进基于去噪的生成模型中的文本到图像对齐，从而实现最先进的结果。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17525v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Paul Grimal, Hervé Le Borgne, Olivier Ferret</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.4, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding</h2>
            <p class="paper-summary">There has been impressive progress in Large Multimodal Models (LMMs). Recent
works extend these models to long inputs, including multi-page documents and
long videos. However, the model size and performance of these long context
models are still limited due to the computational cost in both training and
inference. In this work, we explore an orthogonal direction and process long
inputs without long context LMMs. We propose Frame Selection Augmented
Generation (FRAG), where the model first selects relevant frames within the
input, and then only generates the final outputs based on the selected frames.
The core of the selection process is done by scoring each frame independently,
which does not require long context processing. The frames with the highest
scores are then selected by a simple Top-K selection. We show that this
frustratingly simple framework is applicable to both long videos and multi-page
documents using existing LMMs without any fine-tuning. We consider two models,
LLaVA-OneVision and InternVL2, in our experiments and show that FRAG
consistently improves the performance and achieves state-of-the-art
performances for both long video and long document understanding. For videos,
FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on
Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA
compared with recent LMMs specialized in long document understanding. Code is
available at: https://github.com/NVlabs/FRAG</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces frag, a frame selection augmented generation method for long video and document understanding, which selects relevant frames/pages before using existing lmms for output generation, demonstrating improved performance without fine-tuning.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了一种名为frag的帧选择增强生成方法，用于理解长视频和文档。该方法首先选择相关的帧/页，然后使用现有的lmm生成输出，无需微调即可提高性能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17447v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: De-An Huang, Subhashree Radhakrishnan, Zhiding Yu, Jan Kautz</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.45, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models</h2>
            <p class="paper-summary">Video try-on replaces clothing in videos with target garments. Existing
methods struggle to generate high-quality and temporally consistent results
when handling complex clothing patterns and diverse body poses. We present
3DV-TON, a novel diffusion-based framework for generating high-fidelity and
temporally consistent video try-on results. Our approach employs generated
animatable textured 3D meshes as explicit frame-level guidance, alleviating the
issue of models over-focusing on appearance fidelity at the expanse of motion
coherence. This is achieved by enabling direct reference to consistent garment
texture movements throughout video sequences. The proposed method features an
adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe
for initial 2D image try-on, followed by (2) reconstructing and animating a
textured 3D mesh synchronized with original video poses. We further introduce a
robust rectangular masking strategy that successfully mitigates artifact
propagation caused by leaking clothing information during dynamic human and
garment movements. To advance video try-on research, we introduce HR-VVT, a
high-resolution benchmark dataset containing 130 videos with diverse clothing
types and scenarios. Quantitative and qualitative results demonstrate our
superior performance over existing methods. The project page is at this link
https://2y7c3.github.io/3DV-TON/</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces 3dv-ton, a diffusion-based video try-on framework using textured 3d meshes for temporal consistency, and presents a new high-resolution video try-on dataset (hr-vvt). it addresses the challenge of generating high-quality and temporally consistent video try-on results with complex clothing patterns and diverse body poses.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了3dv-ton，一个基于扩散模型的视频试穿框架，它利用纹理化的3d网格来实现时间一致性， 并提出了一个新的高分辨率视频试穿数据集(hr-vvt)。它解决了在处理复杂服装图案和多样身体姿态时，生成高质量且时间一致的视频试穿结果的挑战。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17414v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Min Wei, Chaohui Yu, Jingkai Zhou, Fan Wang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.5, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition</h2>
            <p class="paper-summary">Personalized image generation has emerged as a promising direction in
multimodal content creation. It aims to synthesize images tailored to
individual style preferences (e.g., color schemes, character appearances,
layout) and semantic intentions (e.g., emotion, action, scene contexts) by
leveraging user-interacted history images and multimodal instructions. Despite
notable progress, existing methods -- whether based on diffusion models, large
language models, or Large Multimodal Models (LMMs) -- struggle to accurately
capture and fuse user style preferences and semantic intentions. In particular,
the state-of-the-art LMM-based method suffers from the entanglement of visual
features, leading to Guidance Collapse, where the generated images fail to
preserve user-preferred styles or reflect the specified semantics.
  To address these limitations, we introduce DRC, a novel personalized image
generation framework that enhances LMMs through Disentangled Representation
Composition. DRC explicitly extracts user style preferences and semantic
intentions from history images and the reference image, respectively, to form
user-specific latent instructions that guide image generation within LMMs.
Specifically, it involves two critical learning stages: 1) Disentanglement
learning, which employs a dual-tower disentangler to explicitly separate style
and semantic features, optimized via a reconstruction-driven paradigm with
difficulty-aware importance sampling; and 2) Personalized modeling, which
applies semantic-preserving augmentations to effectively adapt the disentangled
representations for robust personalized generation. Extensive experiments on
two benchmarks demonstrate that DRC shows competitive performance while
effectively mitigating the guidance collapse issue, underscoring the importance
of disentangled representation learning for controllable and effective
personalized image generation.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces drc, a novel personalized image generation framework that enhances lmms by disentangling style and semantic features to address the guidance collapse issue prevalent in existing methods.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了drc，一种新颖的个性化图像生成框架，通过解耦风格和语义特征来增强大型多模态模型（lmm），以解决现有方法中普遍存在的指导崩溃问题。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17349v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Yiyan Xu, Wuqiang Zheng, Wenjie Wang, Fengbin Zhu, Xinting Hu, Yang Zhang, Fuli Feng, Tat-Seng Chua</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.55, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Towards Generalized and Training-Free Text-Guided Semantic Manipulation</h2>
            <p class="paper-summary">Text-guided semantic manipulation refers to semantically editing an image
generated from a source prompt to match a target prompt, enabling the desired
semantic changes (e.g., addition, removal, and style transfer) while preserving
irrelevant contents. With the powerful generative capabilities of the diffusion
model, the task has shown the potential to generate high-fidelity visual
content. Nevertheless, existing methods either typically require time-consuming
fine-tuning (inefficient), fail to accomplish multiple semantic manipulations
(poorly extensible), and/or lack support for different modality tasks (limited
generalizability). Upon further investigation, we find that the geometric
properties of noises in the diffusion model are strongly correlated with the
semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for
text-guided semantic manipulation, which has the following attractive
capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple
semantic manipulations (e.g., addition, removal, and style transfer) and can be
seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play)
across different modalities (i.e., modality-agnostic); and 2)
$\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via
simply controlling the geometric relationship between noises without tuning or
optimization. Our extensive experiments demonstrate the efficacy of our
approach, highlighting its potential to advance the state-of-the-art in
semantics manipulation.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a novel training-free geometric manipulation method (gtf) for text-guided semantic image editing in diffusion models, claiming improved generalization, modality agnosticism, and efficiency compared to existing fine-tuning-based approaches.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种新的训练自由的几何操作方法 (gtf)，用于扩散模型中的文本引导的语义图像编辑。该方法声称与现有的基于微调的方法相比，具有更好的泛化性、模态不可知性和效率。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17269v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Yu Hong, Xiao Cai, Pengpeng Zeng, Shuai Zhang, Jingkuan Song, Lianli Gao, Heng Tao Shen</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.6000000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Towards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal Fusion</h2>
            <p class="paper-summary">The rapid evolution of deep generative models poses a critical challenge to
deepfake detection, as detectors trained on forgery-specific artifacts often
suffer significant performance degradation when encountering unseen forgeries.
While existing methods predominantly rely on spatial domain analysis, frequency
domain operations are primarily limited to feature-level augmentation, leaving
frequency-native artifacts and spatial-frequency interactions insufficiently
exploited. To address this limitation, we propose a novel detection framework
that integrates multi-scale spatial-frequency analysis for universal deepfake
detection. Our framework comprises three key components: (1) a local spectral
feature extraction pipeline that combines block-wise discrete cosine transform
with cascaded multi-scale convolutions to capture subtle spectral artifacts;
(2) a global spectral feature extraction pipeline utilizing scale-invariant
differential accumulation to identify holistic forgery distribution patterns;
and (3) a multi-stage cross-modal fusion mechanism that incorporates
shallow-layer attention enhancement and deep-layer dynamic modulation to model
spatial-frequency interactions. Extensive evaluations on widely adopted
benchmarks demonstrate that our method outperforms state-of-the-art deepfake
detection methods in both accuracy and generalizability.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a deepfake detection framework leveraging spatial-frequency analysis and cross-modal fusion, achieving improved accuracy and generalizability compared to existing methods.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种利用空间-频率分析和跨模态融合的深度伪造检测框架，与现有方法相比，该框架在准确性和泛化性方面均有所提高。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17223v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Mengyu Qiao, Runze Tian, Yang Wang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.65, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model</h2>
            <p class="paper-summary">Fashion styling and personalized recommendations are pivotal in modern
retail, contributing substantial economic value in the fashion industry. With
the advent of vision-language models (VLM), new opportunities have emerged to
enhance retailing through natural language and visual interactions. This work
proposes FashionM3, a multimodal, multitask, and multiround fashion assistant,
built upon a VLM fine-tuned for fashion-specific tasks. It helps users discover
satisfying outfits by offering multiple capabilities including personalized
recommendation, alternative suggestion, product image generation, and virtual
try-on simulation. Fine-tuned on the novel FashionRec dataset, comprising
331,124 multimodal dialogue samples across basic, personalized, and alternative
recommendation tasks, FashionM3 delivers contextually personalized suggestions
with iterative refinement through multiround interactions. Quantitative and
qualitative evaluations, alongside user studies, demonstrate FashionM3's
superior performance in recommendation effectiveness and practical value as a
fashion assistant.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces fashionm3, a multimodal fashion assistant built on a fine-tuned vision-language model, offering personalized recommendations, alternative suggestions, image generation, and virtual try-on capabilities through multiround dialogue.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了fashionm3，一个基于微调的视觉-语言模型构建的多模态时尚助手，通过多轮对话提供个性化推荐、替代建议、图像生成和虚拟试穿功能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17826v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Kaicheng Pang, Xingxing Zou, Waikeung Wong</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.7000000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion Models</h2>
            <p class="paper-summary">Autonomous Vehicles (AVs) rely on artificial intelligence (AI) to accurately
detect objects and interpret their surroundings. However, even when trained
using millions of miles of real-world data, AVs are often unable to detect rare
failure modes (RFMs). The problem of RFMs is commonly referred to as the
"long-tail challenge", due to the distribution of data including many instances
that are very rarely seen. In this paper, we present a novel approach that
utilizes advanced generative and explainable AI techniques to aid in
understanding RFMs. Our methods can be used to enhance the robustness and
reliability of AVs when combined with both downstream model training and
testing. We extract segmentation masks for objects of interest (e.g., cars) and
invert them to create environmental masks. These masks, combined with carefully
crafted text prompts, are fed into a custom diffusion model. We leverage the
Stable Diffusion inpainting model guided by adversarial noise optimization to
generate images containing diverse environments designed to evade object
detection models and expose vulnerabilities in AI systems. Finally, we produce
natural language descriptions of the generated RFMs that can guide developers
and policymakers to improve the safety and reliability of AV systems.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper proposes a novel approach using adversarially guided diffusion models to generate rare failure modes (rfms) in autonomous vehicle perception systems, aiming to enhance their robustness and reliability.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文提出了一种新颖的方法，使用对抗引导扩散模型来生成自动驾驶感知系统中的罕见故障模式（rfm），旨在增强其鲁棒性和可靠性。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17179v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Mohammad Zarei, Melanie A Jutras, Eliana Evans, Mike Tan, Omid Aaramoon</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.75, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Latent Video Dataset Distillation</h2>
            <p class="paper-summary">Dataset distillation has demonstrated remarkable effectiveness in
high-compression scenarios for image datasets. While video datasets inherently
contain greater redundancy, existing video dataset distillation methods
primarily focus on compression in the pixel space, overlooking advances in the
latent space that have been widely adopted in modern text-to-image and
text-to-video models. In this work, we bridge this gap by introducing a novel
video dataset distillation approach that operates in the latent space using a
state-of-the-art variational encoder. Furthermore, we employ a diversity-aware
data selection strategy to select both representative and diverse samples.
Additionally, we introduce a simple, training-free method to further compress
the distilled latent dataset. By combining these techniques, our approach
achieves a new state-of-the-art performance in dataset distillation,
outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a
2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance
increase.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper presents a new video dataset distillation technique that operates in the latent space using a variational encoder and a diversity-aware data selection strategy, achieving state-of-the-art performance.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种新的视频数据集精馏技术，该技术在潜在空间中使用变分编码器和感知多样性的数据选择策略运行，实现了最先进的性能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17132v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Ning Li, Antai Andy Liu, Jingran Zhang, Justin Cui</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.8, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Scene-Aware Location Modeling for Data Augmentation in Automotive Object Detection</h2>
            <p class="paper-summary">Generative image models are increasingly being used for training data
augmentation in vision tasks. In the context of automotive object detection,
methods usually focus on producing augmented frames that look as realistic as
possible, for example by replacing real objects with generated ones. Others try
to maximize the diversity of augmented frames, for example by pasting lots of
generated objects onto existing backgrounds. Both perspectives pay little
attention to the locations of objects in the scene. Frame layouts are either
reused with little or no modification, or they are random and disregard realism
entirely. In this work, we argue that optimal data augmentation should also
include realistic augmentation of layouts. We introduce a scene-aware
probabilistic location model that predicts where new objects can realistically
be placed in an existing scene. By then inpainting objects in these locations
with a generative model, we obtain much stronger augmentation performance than
existing approaches. We set a new state of the art for generative data
augmentation on two automotive object detection tasks, achieving up to
$2.8\times$ higher gains than the best competing approach ($+1.4$ vs. $+0.5$
mAP boost). We also demonstrate significant improvements for instance
segmentation.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a scene-aware probabilistic location model for data augmentation in automotive object detection, demonstrating significant performance improvements compared to existing methods by realistically augmenting object layouts.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种场景感知的概率位置模型，用于汽车目标检测中的数据增强，通过真实地增强目标布局，相比现有方法，性能有了显著提升。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17076v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Jens Petersen, Davide Abati, Amirhossein Habibian, Auke Wiggers</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.8500000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Distilling semantically aware orders for autoregressive image generation</h2>
            <p class="paper-summary">Autoregressive patch-based image generation has recently shown competitive
results in terms of image quality and scalability. It can also be easily
integrated and scaled within Vision-Language models. Nevertheless,
autoregressive models require a defined order for patch generation. While a
natural order based on the dictation of the words makes sense for text
generation, there is no inherent generation order that exists for image
generation. Traditionally, a raster-scan order (from top-left to bottom-right)
guides autoregressive image generation models. In this paper, we argue that
this order is suboptimal, as it fails to respect the causality of the image
content: for instance, when conditioned on a visual description of a sunset, an
autoregressive model may generate clouds before the sun, even though the color
of clouds should depend on the color of the sun and not the inverse. In this
work, we show that first by training a model to generate patches in
any-given-order, we can infer both the content and the location (order) of each
patch during generation. Secondly, we use these extracted orders to finetune
the any-given-order model to produce better-quality images. Through our
experiments, we show on two datasets that this new generation method produces
better images than the traditional raster-scan approach, with similar training
costs and no extra annotations.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper proposes a novel semantically-aware patch order for autoregressive image generation, improving image quality compared to raster-scan order without additional cost or annotations.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种新的语义感知块顺序，用于自回归图像生成，与光栅扫描顺序相比，在不增加成本或注释的情况下提高了图像质量。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17069v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Rishav Pramanik, Antoine Poupon, Juan A. Rodriguez, Masih Aminbeidokhti, David Vazquez, Christopher Pal, Zhaozheng Yin, Marco Pedersoli</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.9, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">ESDiff: Encoding Strategy-inspired Diffusion Model with Few-shot Learning for Color Image Inpainting</h2>
            <p class="paper-summary">Image inpainting is a technique used to restore missing or damaged regions of
an image. Traditional methods primarily utilize information from adjacent
pixels for reconstructing missing areas, while they struggle to preserve
complex details and structures. Simultaneously, models based on deep learning
necessitate substantial amounts of training data. To address this challenge, an
encoding strategy-inspired diffusion model with few-shot learning for color
image inpainting is proposed in this paper. The main idea of this novel
encoding strategy is the deployment of a "virtual mask" to construct
high-dimensional objects through mutual perturbations between channels. This
approach enables the diffusion model to capture diverse image representations
and detailed features from limited training samples. Moreover, the encoding
strategy leverages redundancy between channels, integrates with low-rank
methods during iterative inpainting, and incorporates the diffusion model to
achieve accurate information output. Experimental results indicate that our
method exceeds current techniques in quantitative metrics, and the
reconstructed images quality has been improved in aspects of texture and
structural integrity, leading to more precise and coherent results.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces an encoding strategy-inspired diffusion model (esdiff) with few-shot learning for color image inpainting, using virtual masks for channel perturbations to enhance feature extraction from limited data and improve inpainting quality.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种编码策略启发的扩散模型（esdiff），结合小样本学习用于彩色图像修复。该模型利用虚拟掩码进行通道扰动，从而在有限数据下增强特征提取并提高修复质量。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17524v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Junyan Zhang, Yan Li, Mengxiao Geng, Liu Shi, Qiegen Liu</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.9500000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation</h2>
            <p class="paper-summary">Soccer is a globally popular sporting event, typically characterized by long
matches and distinctive highlight moments. Recent advances in Multimodal Large
Language Models (MLLMs) offer promising capabilities in temporal grounding and
video understanding, soccer commentary generation often requires precise
temporal localization and semantically rich descriptions over long-form video.
However, existing soccer MLLMs often rely on the temporal a priori for caption
generation, so they cannot process the soccer video end-to-end. While some
traditional approaches follow a two-step paradigm that is complex and fails to
capture the global context to achieve suboptimal performance. To solve the
above issues, we present TimeSoccer, the first end-to-end soccer MLLM for
Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos.
TimeSoccer jointly predicts timestamps and generates captions in a single pass,
enabling global context modeling across 45-minute matches. To support long
video understanding of soccer matches, we introduce MoFA-Select, a
training-free, motion-aware frame compression module that adaptively selects
representative frames via a coarse-to-fine strategy, and incorporates
complementary training paradigms to strengthen the model's ability to handle
long temporal sequences. Extensive experiments demonstrate that our TimeSoccer
achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end
form, generating high-quality commentary with accurate temporal alignment and
strong semantic relevance.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces timesoccer, an end-to-end mllm for generating soccer commentary, which addresses the limitations of existing methods by jointly predicting timestamps and captions for full-match videos, achieving state-of-the-art performance.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了 timesoccer，一个用于生成足球评论的端到端 mllm。它通过联合预测完整比赛视频的时间戳和标题，解决了现有方法的局限性，并实现了最先进的性能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17365v2" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, Changbo Wang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 1.0, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Dual Prompting Image Restoration with Diffusion Transformers</h2>
            <p class="paper-summary">Recent state-of-the-art image restoration methods mostly adopt latent
diffusion models with U-Net backbones, yet still facing challenges in achieving
high-quality restoration due to their limited capabilities. Diffusion
transformers (DiTs), like SD3, are emerging as a promising alternative because
of their better quality with scalability. In this paper, we introduce DPIR
(Dual Prompting Image Restoration), a novel image restoration method that
effectivly extracts conditional information of low-quality images from multiple
perspectives. Specifically, DPIR consits of two branches: a low-quality image
conditioning branch and a dual prompting control branch. The first branch
utilizes a lightweight module to incorporate image priors into the DiT with
high efficiency. More importantly, we believe that in image restoration,
textual description alone cannot fully capture its rich visual characteristics.
Therefore, a dual prompting module is designed to provide DiT with additional
visual cues, capturing both global context and local appearance. The extracted
global-local visual prompts as extra conditional control, alongside textual
prompts to form dual prompts, greatly enhance the quality of the restoration.
Extensive experimental results demonstrate that DPIR delivers superior image
restoration performance.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces dpir, a novel image restoration method using diffusion transformers with a dual prompting mechanism (textual and global-local visual cues) to enhance restoration quality.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了dpir，一种新颖的图像修复方法，它使用扩散transformer和双重提示机制（文本提示和全局-局部视觉线索）来提高修复质量。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17825v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Dehong Kong, Fan Li, Zhixin Wang, Jiaqi Xu, Renjing Pei, Wenbo Li, WenQi Ren</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 1.05, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation</h2>
            <p class="paper-summary">Accurate depth estimation enhances endoscopy navigation and diagnostics, but
obtaining ground-truth depth in clinical settings is challenging. Synthetic
datasets are often used for training, yet the domain gap limits generalization
to real data. We propose a novel image-to-image translation framework that
preserves structure while generating realistic textures from clinical data. Our
key innovation integrates Stable Diffusion with ControlNet, conditioned on a
latent representation extracted from a Per-Pixel Shading (PPS) map. PPS
captures surface lighting effects, providing a stronger structural constraint
than depth maps. Experiments show our approach produces more realistic
translations and improves depth estimation over GAN-based MI-CycleGAN. Our code
is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper proposes a novel image-to-image translation framework leveraging stable diffusion and controlnet for generating realistic colonoscopy images from synthetic data, improving depth estimation by using per-pixel shading (pps) maps for structural constraints.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种新颖的图像到图像翻译框架，利用 stable diffusion 和 controlnet 从合成数据生成逼真的结肠镜图像，并通过使用逐像素着色 (pps) 图进行结构约束来提高深度估计。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17067v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Xinqi Xiong, Andrea Dunn Beltran, Jun Myeong Choi, Marc Niethammer, Roni Sengupta</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 1.1, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">ePBR: Extended PBR Materials in Image Synthesis</h2>
            <p class="paper-summary">Realistic indoor or outdoor image synthesis is a core challenge in computer
vision and graphics. The learning-based approach is easy to use but lacks
physical consistency, while traditional Physically Based Rendering (PBR) offers
high realism but is computationally expensive. Intrinsic image representation
offers a well-balanced trade-off, decomposing images into fundamental
components (intrinsic channels) such as geometry, materials, and illumination
for controllable synthesis. However, existing PBR materials struggle with
complex surface models, particularly high-specular and transparent surfaces. In
this work, we extend intrinsic image representations to incorporate both
reflection and transmission properties, enabling the synthesis of transparent
materials such as glass and windows. We propose an explicit intrinsic
compositing framework that provides deterministic, interpretable image
synthesis. With the Extended PBR (ePBR) Materials, we can effectively edit the
materials with precise controls.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces extended pbr (epbr) materials for intrinsic image representation, enabling more realistic synthesis of transparent and high-specular surfaces in computer vision and graphics.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了扩展的pbr（epbr）材料，用于内在图像表示，从而能够在计算机视觉和图形学中更真实地合成透明和高光表面。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17062v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Zongfang Lin, Heather Yu</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 1.1500000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">DPMambaIR:All-in-One Image Restoration via Degradation-Aware Prompt State Space Model</h2>
            <p class="paper-summary">All-in-One image restoration aims to address multiple image degradation
problems using a single model, significantly reducing training costs and
deployment complexity compared to traditional methods that design dedicated
models for each degradation type. Existing approaches typically rely on
Degradation-specific models or coarse-grained degradation prompts to guide
image restoration. However, they lack fine-grained modeling of degradation
information and face limitations in balancing multi-task conflicts. To overcome
these limitations, we propose DPMambaIR, a novel All-in-One image restoration
framework. By integrating a Degradation-Aware Prompt State Space Model (DP-SSM)
and a High-Frequency Enhancement Block (HEB), DPMambaIR enables fine-grained
modeling of complex degradation information and efficient global integration,
while mitigating the loss of high-frequency details caused by task competition.
Specifically, the DP-SSM utilizes a pre-trained degradation extractor to
capture fine-grained degradation features and dynamically incorporates them
into the state space modeling process, enhancing the model's adaptability to
diverse degradation types. Concurrently, the HEB supplements high-frequency
information, effectively addressing the loss of critical details, such as edges
and textures, in multi-task image restoration scenarios. Extensive experiments
on a mixed dataset containing seven degradation types show that DPMambaIR
achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM,
respectively. These results highlight the potential and superiority of
DPMambaIR as a unified solution for All-in-One image restoration.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces dpmambair, a novel all-in-one image restoration framework using a degradation-aware prompt state space model and a high-frequency enhancement block to address multiple image degradation problems with improved performance.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了 dpmambair，一种新型的 all-in-one 图像恢复框架，它使用了一种降级感知提示状态空间模型和高频增强块，以解决多种图像降级问题，并提高了性能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(3/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(5/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17732v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Zhanwen Liu, Sai Zhou, Yuchao Dai, Yang Wang, Yisheng An, Xiangmo Zhao</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 1.2000000000000002, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting</h2>
            <p class="paper-summary">Open-world object counting leverages the robust text-image alignment of
pre-trained vision-language models (VLMs) to enable counting of arbitrary
categories in images specified by textual queries. However, widely adopted
naive fine-tuning strategies concentrate exclusively on text-image consistency
for categories contained in training, which leads to limited generalizability
for unseen categories. In this work, we propose a plug-and-play Semantic-Driven
Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the
training set to unseen categories with minimal overhead in parameters and
inference time. First, we introduce a two-stage visual prompt learning strategy
composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided
Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts,
and then TGPR distills latent structural patterns from the VLM's text encoder
to refine these prompts. During inference, we dynamically synthesize the visual
prompts for unseen categories based on the semantic correlation between unseen
and training categories, facilitating robust text-image alignment for unseen
categories. Extensive experiments integrating SDVPT with all available
open-world object counting models demonstrate its effectiveness and
adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper proposes a semantic-driven visual prompt tuning framework (sdvpt) for open-world object counting, improving generalization to unseen categories by leveraging semantic correlations between seen and unseen categories. sdvpt consists of category-specific prompt initialization (cspi) and topology-guided prompt refinement (tgpr).</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文提出了一种语义驱动的视觉提示调整框架（sdvpt），用于开放世界的物体计数，通过利用已见类别和未见类别之间的语义相关性，提高了对未见类别的泛化能力。sdvpt由类别特定提示初始化（cspi）和拓扑引导提示细化（tgpr）组成。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(3/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(5/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17395v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Yiming Zhao, Guorong Li, Laiyun Qing, Amin Beheshti, Jian Yang, Michael Sheng, Yuankai Qi, Qingming Huang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 1.25, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">I-INR: Iterative Implicit Neural Representations</h2>
            <p class="paper-summary">Implicit Neural Representations (INRs) have revolutionized signal processing
and computer vision by modeling signals as continuous, differentiable functions
parameterized by neural networks. However, their inherent formulation as a
regression problem makes them prone to regression to the mean, limiting their
ability to capture fine details, retain high-frequency information, and handle
noise effectively. To address these challenges, we propose Iterative Implicit
Neural Representations (I-INRs) a novel plug-and-play framework that enhances
signal reconstruction through an iterative refinement process. I-INRs
effectively recover high-frequency details, improve robustness to noise, and
achieve superior reconstruction quality. Our framework seamlessly integrates
with existing INR architectures, delivering substantial performance gains
across various tasks. Extensive experiments show that I-INRs outperform
baseline methods, including WIRE, SIREN, and Gauss, in diverse computer vision
applications such as image restoration, image denoising, and object occupancy
prediction.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces iterative implicit neural representations (i-inrs) to improve high-frequency detail capture and noise robustness in implicit neural representations for tasks like image restoration and denoising.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了迭代隐式神经表示（i-inrs），以提高隐式神经表示在高频细节捕获和噪声鲁棒性方面的能力，适用于图像修复和去噪等任务。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(3/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(5/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17364v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Ali Haider, Muhammad Salman Ali, Maryam Qamar, Tahir Khalil, Soo Ye Kim, Jihyong Oh, Enzo Tartaglione, Sung-Ho Bae</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 1.3, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model</h2>
            <p class="paper-summary">This paper presents the technical solution proposed by Huawei Translation
Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation
for Complex Layouts" competition at the 19th International Conference on
Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging
state-of-the-art open-source large vision-language model (LVLM), we introduce a
training framework that combines multi-task learning with perceptual
chain-of-thought to develop a comprehensive end-to-end document translation
system. During the inference phase, we apply minimum Bayesian decoding and
post-processing strategies to further enhance the system's translation
capabilities. Our solution uniquely addresses both OCR-based and OCR-free
document image translation tasks within a unified framework. This paper
systematically details the training methods, inference strategies, LVLM base
models, training data, experimental setups, and results, demonstrating an
effective approach to document image machine translation.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper presents an end-to-end document image machine translation system using a large vision-language model, incorporating multi-task learning and perceptual chain-of-thought, and achieving state-of-the-art performance in the dimt25 competition.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种端到端的文档图像机器翻译系统，该系统采用大型视觉语言模型，结合了多任务学习和感知链式思考，并在 dimt25 比赛中取得了领先的成绩。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(3/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(5/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(4/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2504.17315v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Zhanglin Wu, Tengfei Song, Ning Xie, Weidong Zhang, Pengfei Li, Shuang Wu, Chong Li, Junhao Zhu, Hao Yang</p>
            
        </motion.div>
        
    </div>

    <footer class="footer">
        Generated on 2025-04-28 06:02:23 UTC. Powered by <a href="https://github.com/onion-liu" target="_blank">onion-liu</a>.
    </footer>

</body>
</html>